Blog Post 4

LDA Topic Modelling
LDA Topic Modelling

Mekhala Kumar


December 7, 2022


Initial steps

First a few more common words were removed from the document feature matrix so that the analysis is not cluttered by those words.

articles_dfm<-readRDS(file = "Data/News_DFM.rds")
Document-feature matrix of: 1,157 documents, 22,370 features (99.21% sparse) and 3 docvars.
docs    solitary two-day fixture great britain france 1900 olympics prospects
  text1        1       1       1     1       1      1    1        9         3
  text2        0       0       0     0       0      0    0        2         0
  text3        0       0       0     2       2      0    0        3         0
  text4        0       0       0     0       0      0    0       10         0
  text5        0       0       0     0       0      0    0        3         0
  text6        0       0       0     2       1      0    0        2         0
docs    cricket's
  text1         2
  text2         0
  text3         0
  text4         0
  text5         0
  text6         0
[ reached max_ndoc ... 1,151 more documents, reached max_nfeat ... 22,360 more features ]
#textplot_wordcloud(articles_dfm, min_count = 50, random_order = FALSE)
articles_dfm <- dfm_remove(articles_dfm, c("said","also","says","can","just"), verbose = TRUE)
removed 5 features
#textplot_wordcloud(articles_dfm, min_count = 50, random_order = FALSE)

Semantic Network

For the semantic network, I limited the document feature matrix to terms that appeared a least 15 times and in 25% of the documents. This consisted of 48 terms which I plotted.
Unsurprisingly, this shows that most of the articles discuss India in the Olympics (as Indian newspaper articles were used). One major theme that can be observed is the discussion of the hockey team, the men’s team had placed third in over four decades hence marking history and was led by the captain Manpreet Singh. Other significant terms include medals and medal colours perhaps pertaining to victories by other Indian athletes; which may be more clearly observed through a topic model.

dfm_refined <- dfm_trim(articles_dfm, min_termfreq = 15)
dfm_refined <- dfm_trim(dfm_refined, min_docfreq = .25, docfreq_type = "prop")

fcm<- fcm(dfm_refined)
[1] 48 48
top_features <- names(topfeatures(fcm, 48))
fcm_refined <- fcm_select(fcm, pattern = top_features, selection = "keep")
[1] 48 48
size <- log(colSums(fcm_refined))
textplot_network(fcm_refined, vertex_size = size / max(size) * 3)

Topic Modelling

I used the topicmodels package to run a Latent Dirichlet Allocation topic model.

##Preparatory steps

To run the model, the data had to be in the form of a document term matrix. First the document feature matrix was converted into a one-token-per-document-per-row table and then this table was converted into a document term matrix.

# A tibble: 201,774 × 3
   document term     count
   <chr>    <chr>    <dbl>
 1 text1    solitary     1
 2 text214  solitary     1
 3 text245  solitary     1
 4 text629  solitary     1
 5 text639  solitary     1
 6 text797  solitary     1
 7 text1099 solitary     1
 8 text1    two-day      1
 9 text311  two-day      1
10 text368  two-day      1
# … with 201,764 more rows
news_dtm <- articles_tidy %>%
  cast_dtm(document, term, count)
<<DocumentTermMatrix (documents: 1157, terms: 22365)>>
Non-/sparse entries: 201774/25674531
Sparsity           : 99%
Maximal term length: 84
Weighting          : term frequency (tf)

Word topic probabilities

3 topics

There was not much valuable information being provided by keeping only three topics.
The first topic seemed to be about Neeraj Chopra winning the gold medal in javelin throw and PV Sindhu winning the bronze medal in badminton and the second topic seemed to focus on hockey. The third topic just had common terms pertaining to Olympics.
Hence, I ran a search_k function in order to find the optimal number of topics to use to derive a meaningful analysis.

news_lda <- LDA(news_dtm, k = 3, control = list(seed = 2345))
A LDA_VEM topic model with 3 topics.
#extracting per-topic-per-word probabilities
news_topics <- tidy(news_lda, matrix = "beta")
# A tibble: 67,095 × 3
   topic term         beta
   <int> <chr>       <dbl>
 1     1 solitary 5.64e- 5
 2     2 solitary 1.41e-10
 3     3 solitary 1.26e- 5
 4     1 two-day  5.31e- 5
 5     2 two-day  1.65e- 5
 6     3 two-day  8.99e- 6
 7     1 fixture  4.24e- 5
 8     2 fixture  1.49e-32
 9     3 fixture  3.13e- 5
10     1 great    2.53e- 3
# … with 67,085 more rows
#Finding the top 10 terms
news_top_10 <- news_topics %>%
  group_by(topic) %>%
  slice_max(beta, n = 10) %>% 
  ungroup() %>%
  arrange(topic, -beta)

  mutate(term = reorder_within(term, beta, topic)) %>%
  ggplot(aes(beta, term, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +

Choosing K

Based on the semantic coherence, I selected K as 25.

Fitting the Latent Dirichlet Allocation topic model for 25 topics

At first I plotted the top 10 words for each topic, however I thought that having the top 20 words would give a better idea of the content covered. ## 25 topics Many of the topics (2,5,8,12,18,20) focus on hockey. However, different aspects pertaining to the game are discussed.
Topic 2 discusses the Prime Minister honouring the hockey players with the highest sports award of India- the Major Dhyan Chand Khel Ratna.
Topic 5 is a little unclear as it has information about marketing but also mentions the police.
Topic 8 covers the details of the hockey games and mentions the other countries that took part in the semifinal and final mens’ matches.
Topic 12 can be considered as the reaction of the public to the game and content of tweets regarding the same.
Topic 18 has a mix of the hockey team’s achievements and PV Sindhu’s (badminton player) achievement Topic 20 discusses how the government of the state of Odisha reacted to the win and stated it would continue to sponsor the women’s and men’s hockey teams.

All of the players mentioned in the topic models were either winners or in the final rounds of their respective sports. They also contained information regarding tweets and prominent celebrities that congratulated them. There was no noticeable difference in the top terms used for male sports players and female sports players.However, one difference that can be observed is that although both the mens’ and womens’ Indian hockey teams played well, majority of the topics were regarding the mens’ achievements (2,8,12 and 18). Perhaps, because they placed third whereas the womens’ team placed fourth.
Similar to hockey, the sports of javelin throw, wrestling and weightlifting were mentioned in several topics. For the gold medallist Neeraj Chopra, there was also a topic which pertained to his army background. Finally, many of the topics about the winners mention cash prizes from sources such as the government and the term ‘rs’ which stands for rupees which is the Indian currency.
The topic that was popular outside the Indian context (topics 1 and 15), regarding the gymnast Simon Biles and the importance of mental health as she had withdrawn from the Olympics due to mental health concerns.

news_lda25 <- LDA(news_dtm, k = 25, control = list(seed = 2345))
A LDA_VEM topic model with 25 topics.
#extracting per-topic-per-word probabilities
news_topics25 <- tidy(news_lda25, matrix = "beta")
# A tibble: 559,125 × 3
   topic term          beta
   <int> <chr>        <dbl>
 1     1 solitary 3.47e- 25
 2     2 solitary 5.65e-233
 3     3 solitary 4.77e-232
 4     4 solitary 4.00e-231
 5     5 solitary 9.70e-232
 6     6 solitary 2.56e-231
 7     7 solitary 2.56e-233
 8     8 solitary 2.46e-231
 9     9 solitary 2.52e-  4
10    10 solitary 8.94e- 34
# … with 559,115 more rows
#Finding the top 10 terms
news_top_10_25 <- news_topics25 %>%
  group_by(topic) %>%
  slice_max(beta, n = 10) %>% 
  ungroup() %>%
  arrange(topic, -beta)

  mutate(term = reorder_within(term, beta, topic)) %>%
  ggplot(aes(beta, term, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +

#Finding top 20 terms
news_top_20_25 <- news_topics25 %>%
  group_by(topic) %>%
  slice_max(beta, n = 20) %>% 
  ungroup() %>%
  arrange(topic, -beta)

  mutate(term = reorder_within(term, beta, topic)) %>%
  ggplot(aes(beta, term, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +

Since many of the topic models had their top terms as “olympics” or “India”, I wanted to check whether removing these terms would offer a deeper insight into the topics.

#removing some of the common words and then seeing how the topic model looks
articles_dfm_common_rem <- dfm_remove(articles_dfm, c("olympics","olympic","india","indian","tokyo","sports","#tokyo2020","2020","2021","india's"), verbose = TRUE)
removed 10 features
# A tibble: 195,993 × 3
   document term     count
   <chr>    <chr>    <dbl>
 1 text1    solitary     1
 2 text214  solitary     1
 3 text245  solitary     1
 4 text629  solitary     1
 5 text639  solitary     1
 6 text797  solitary     1
 7 text1099 solitary     1
 8 text1    two-day      1
 9 text311  two-day      1
10 text368  two-day      1
# … with 195,983 more rows
news_dtm2<- articles_tidy2 %>%
  cast_dtm(document, term, count)
<<DocumentTermMatrix (documents: 1157, terms: 22355)>>
Non-/sparse entries: 195993/25668742
Sparsity           : 99%
Maximal term length: 84
Weighting          : term frequency (tf)

Topic Models with some common words removed

Most of the topics regarding the prominent sports players remained the same.
This model did make it more clear as to why the word police occured in one of the topics that was about hockey. Topic 15 in the model includes words such as casteist, women’s and hockey which refers to the incident where casteist remarks about women hockey players were made after the women’s team had lost a semifinal.
Moreover, Topic 24 has information that is not related to the Olympics at all, which may indicate that some of the news articles in the dataframe could have had multiple headlines being discussed and gotten mixed up with the Olympics news.

news_lda25_remove <- LDA(news_dtm2, k = 25, control = list(seed = 2345))
A LDA_VEM topic model with 25 topics.
#extracting per-topic-per-word probabilities
news_topics25_remove <- tidy(news_lda25_remove, matrix = "beta")
# A tibble: 558,875 × 3
   topic term         beta
   <int> <chr>       <dbl>
 1     1 solitary 8.63e- 5
 2     2 solitary 2.06e- 4
 3     3 solitary 6.96e-74
 4     4 solitary 6.74e- 5
 5     5 solitary 3.72e-74
 6     6 solitary 9.11e-74
 7     7 solitary 4.09e-74
 8     8 solitary 4.43e-74
 9     9 solitary 7.81e-74
10    10 solitary 3.93e-74
# … with 558,865 more rows
#Finding the top 20 terms
news_top_20_25_remove <- news_topics25_remove %>%
  group_by(topic) %>%
  slice_max(beta, n = 20) %>% 
  ungroup() %>%
  arrange(topic, -beta)

  mutate(term = reorder_within(term, beta, topic)) %>%
  ggplot(aes(beta, term, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +

Greatest difference between 2 topics

In this LDA model, for the topics regarding hockey, the different aspects covered in each are observable. However, for the topics about Neeraj Chopra in the javelin throw event- topics 11 and 25, it is not clear as to the difference in the 2 topics. Hence, I wanted to check for the words which have the greatest difference between the 2 topics. I kept the beta value greater than 1/5000.
In topic 11, the common words seem to be more general such as the act of winning and being in the finals, whereas in topic 25, the common words are more specific to Neeraj Chopra and how his winning was a historical moment in Indian sports.

#Different Neeraj Chopra topics
beta_11_25<- news_topics25_remove %>%
  mutate(topic = paste0("topic", topic))%>%
  pivot_wider(names_from =topic, values_from = beta)%>% 
  filter(topic11 > .005| topic25 > .005) %>%
  mutate(log_ratio = log2(topic25/ topic11))

[1] 2.164677
[1] -1.883413
testt <- ggplot(beta_11_25, aes(x = `term`, y = `log_ratio`, width = .5)) +
  geom_bar(position = "dodge", stat = "identity") +

Warning: package 'plotly' was built under R version 4.2.2

Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':

The following object is masked from 'package:stats':

The following object is masked from 'package:graphics':


In the next post, I plan to implement structural topic models.